Description of the project:

This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how structural aspects of education (access, staffing, and outcomes) have evolved over time and how changes in one variable may influence changes in another. We will analyze variables in different categories and see differences between sexes. By focusing on longitudinal, cross-national data, we aim to identify potential patterns and meaningful interactions that could inform educational policy and investment.

The analysis of various indicators will allow us to examine temporal changes and potential external factors that might shape different educational outcomes. We selected three major indicator categories that reflect different dimensions of the education system: Enrollment rate vs Attendance rate We will compare enrollment and attendance rates across different levels of schooling (primary, secondary, tertiary) to examine whether higher enrollment leads to consistent attendance over time. Number of teachers vs Teaching staff compensation We will investigate how changes in staffing levels correspond with compensation levels across educational stages. Educational attainment rate We will track changes in the proportion of the population and see differences between sexes achieving different levels of education to understand broader societal progress and disparities. Motivation When we were brainstorming datasets to go with, we had difficulty choosing a topic because the possibilities were endless. We looked at several options from public health to climate to economics, but we eventually narrowed it down to education, which was something that all of us are connected to and that our classmates and readers could easily relate to. We initially considered focusing on Emory-specific data but broadened our scope to the World Bank’s Education Statistics for its larger context and long-term coverage. As students at an American university, we were especially interested in examining data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment, attendance, and staffing can reveal whether increased investment in education translates into better participation and equity. This directly relates to us, as college students who have undergone the three main levels of education and were told that our own efforts, most notably in high school, would determine our outcomes later in life. But how much do individual outcomes actually depend on the student, and how much on the school systems and structures that support them?


Structure:

We will perform exploratory data analysis (EDA) in R using the tidyverse suite of libraries. Our analysis will include: - Cleaning and reshaping selected indicators - Visualizing trends across time and education levels - Examining relationships among variables through summary statistics and correlation plots - Interpreting observed trends in relation to historical or policy-level changes such as shifts in federal funding


Visualizations and stats with code reading guide:


Code chunk #1:

Environment & Reproducibility The following chunk installs (if needed) and/or loads the all of the packages that will be used throughout the notebook. Declaring dependencies up front makes the notebook reproducible in a clean environment, and printing a small message confirms successful execution.

# install.packages("readr")
# install.packages("knitr")
# install.packages("tidyverse")
# install.packages("plotly")
# install.packages("patchwork")
# install.packages("rmarkdown")
library(readr)
library(knitr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.1.0
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(patchwork)
library(rmarkdown)

opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)

message("All packages loaded.")
## All packages loaded.

Code chunk #2:

Ingest & Initial Cleaning (USA Scope) The following chunk:
- Reads the raw World Bank file (EdStats_v01.csv).
- Filters to the United States (Country code == "USA").
- Drops the 2024 column because it is entirely missing.
- Writes the cleaned dataset to EdStats_USA.csv so later chunks start with a filtered version.

This simplifies the dataset to the country of interest and removes all-NA columns to keep downstream transformations clean.

Build Analysis Subsets
We create two thematic subsets aligned to the research questions:

  1. Enrollment vs. Attendance
  • Keeps indicators with “total net enrolment rate” or “total net attendance rate.”
    • Note that “enrolment is misspelled, this is corrected later in the code
  • Excludes sex-specific series and parity indices.
  • Drops all-NA columns and saves as EdStats_attend.csv.
  1. Teachers vs. Compensation
  • Keeps indicators related to teacher counts and teaching staff compensation.
  • Excludes sex splits and qualification percentages.
  • Drops all-NA columns and saves as EdStats_teacher.csv.

This ensures we work with clean, directly comparable variables.

#set up data for use
data <- read_csv("EdStats_v01.csv")

#clean up data to have USA data
data_clean <- data %>%
  filter(grepl("USA", `Country code`)) %>%
  select(-`2024`) #all na
#make sure no NA cols remain
colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]
## character(0)
write_csv(data_clean, "EdStats_USA.csv")

#filter for enrollment/attendance info
data <- read_csv("EdStats_USA.csv")
data_filter_attend <- data %>%
  filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% view() %>%
  write_csv("EdStats_attend.csv")

#filter for num teachers and teaching staff compensation info
data <- read_csv("EdStats_USA.csv")
data_filter_teacher <- data %>%
  filter(grepl("teachers in|teaching staff compensation", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("female|male", `Indicator name`)) %>%
  filter(!grepl("percentage of qualified", `Indicator name`, ignore.case = TRUE)) %>%
  select(where(~!all(is.na(.)))) %>% view() %>%
  write_csv("EdStats_teacher.csv")

Code chunk #3:

Enrollment vs. Attendance — Reshape, Visualize & Summarize
The following chunk:
- Reads EdStats_attend.csv and drops sparse early years.
- Uses pivot_longer to reshape data into tidy (Year, Value) format.
- Creates two new variables:
- Type: Attendance vs Enrollment
- Level: Primary, Middle (Lower Secondary), High (Upper Secondary)
- Produces faceted line charts comparing trends over time.
- Generates bar plots to inspect data coverage by variable, year, and level.
- Calculates summary statistics (mean, median, SD, quartiles, IQR) by Type × Level.
- Saves stats to tendency_attend.csv.

Reading tip: Look for patterns in the gap between enrollment and attendance. Use coverage plots to check where data is sparse.

#add scatter plot? correlation matrix?
data_attend <- read_csv("EdStats_attend.csv")

#select years with most observations
data_attend_years <- data_attend %>%
  select(!c(`1975`, `1986`, `1987`, `1990`, `1991`, `1993`, `1995`, `1996`, `1999`, `1994`))

#reshape data for visualization
data_long <- data_attend_years %>%
  pivot_longer(cols=`2005`:`2022`, names_to="Year", values_to="Value") %>%
  mutate(
    Year=as.numeric(Year),
    Type=case_when(
      grepl("attendance", `Indicator name`, ignore.case=TRUE) ~ "Attendance",
      grepl("enrolment", `Indicator name`, ignore.case=TRUE) ~ "Enrollment",
      TRUE ~ "Other"
    ),
    Level=case_when(
      grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Elementary School",
      grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
      grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
      TRUE ~ "Other"
    )
  ) %>%
  filter(Type!="Other", !is.na(Value))

#comparison plots
p_line_per_school_horizontal <- ggplot(data_long, aes(x=Year, y=Value, color=Type)) +
  geom_line(size=1.2, na.rm=TRUE) +
  geom_point(size=2, alpha=0.8) +
  facet_grid(~Level, scales="fixed", switch="y") +
  theme_minimal() +
  theme(
    strip.background=element_rect(fill="grey90", color=NA),
    panel.spacing=unit(1, "lines")
  ) +
  labs(
    title="Attendance vs. Enrollment Trends in Schools Over Time (USA)",
    y="Rate (%)",
    color="Variable"
  ) +
  scale_color_manual(values=c("Attendance"="#FF69B4", "Enrollment"="#3bbf8f")
  )
ggsave("line_per_school_horizontal.png", p_line_per_school_horizontal)

p_line_interactive1 <- ggplotly(p_line_per_school_horizontal)
p_line_interactive1
#supplemental plots about the data itself
#different colors per level
#add labels or plotly to these
#maybe shrink them
p1 <- ggplot(data_long, aes(x=Type, fill=Type)) +
  geom_bar() +
  theme_minimal() +
  labs(title="Count of Attendance vs Enrollment Observations",
       x="Variable",
       y="Count") 
# p1
ggsave("Observation_counts_variable_attend.png", p1)

p2<-ggplot(data_long, aes(x=factor(Year))) +
  geom_bar(fill="#c562f0") +
  theme_minimal() +
  labs(title="Number of Observations Per Year (All Variables)",
       x="Year",
       y="Count")
# p2
ggsave("Observation_counts_year_attend.png", p2)

p3<-ggplot(data_long, aes(x=Level, fill=Type)) +
  geom_bar(position="dodge") +
  theme_minimal() +
  labs(title="Number of Observations Per School Level",
       x="School Level",
       y="Count",
       fill="Type")
# p3
ggsave("Observation_counts_school_attend.png", p3)

# combine_supp_plots <- p1+p2+p3
# combine_supp_plots

#tendency measures
summary_stats <- data_long %>%
  group_by(Type, Level) %>%
  summarise(
    n = n(),                           
    Mean = mean(Value, na.rm=TRUE),
    Median = median(Value, na.rm=TRUE),
    SD = sd(Value, na.rm=TRUE),
    Variance = var(Value, na.rm=TRUE),
    Minimum = min(Value, na.rm=TRUE),
    Maximum = max(Value, na.rm=TRUE),
    Range = Maximum - Minimum,
    Q1 = quantile(Value, 0.25, na.rm=TRUE),
    Q3 = quantile(Value, 0.75, na.rm=TRUE),
    IQR = Q3 - Q1,
  ) %>%
  arrange(Type, Level)  

paged_table(summary_stats)
write_csv(summary_stats, "tendency_attend.csv")

Code chunk #4:

Teachers vs. Compensation — Reshape, Visualize & Summarize The following chunk:
- Reads EdStats_teacher.csv and filters out non-teaching staff data.
- Focuses on the most complete years (2014–2021).
- Reshapes data with pivot_longer and creates:
- Type: Number of Teachers vs Compensation (% of total expenditure)
- Level: Pre-Primary, Primary, Middle, High, Undergraduate, Post-Grad
- Scales teacher counts per 10,000 to make them visually comparable to percentages.
- Produces faceted line charts for side-by-side trend comparisons.
- Generates bar plots to examine data coverage by type, year, and level.
- Calculates summary statistics (mean, median, SD, quartiles, IQR) by Type × Level.
- Saves stats to tendency_teachers.csv.

Reading tip: Watch for whether compensation shares track with teacher counts over time, or if they diverge.

data_teacher <- read_csv("EdStats_teacher.csv") 

#select data with most observations and remove data not using yet
data_teacher <- data_teacher %>% 
  filter(!grepl("non-teaching staff", `Indicator name`, ignore.case = TRUE)) %>%
  filter(!grepl("upper-secondary", `Indicator name`, ignore.case = TRUE)) %>% #no teacher count
  select(!c(`1975`, `1986`, `1987`, `1990`, `1991`, `1993`, `1995`, `1996`, `1971`, `1972`, `1973`, `1974`, `1976`, `1977`, `1984`, `1985`, `1998`, `1992`, `1994`, `1998`, `2022`, `1988`))

#Set up data for visualization
data_teachers_long <- data_teacher %>%
  pivot_longer(cols=`2014`:`2021`, names_to="Year", values_to="Value") %>%
  mutate(
    Year = as.numeric(Year),
    Type = case_when(
      grepl("number", `Indicator name`, ignore.case=TRUE) ~ "Number of Teachers",
      grepl("compensation", `Indicator name`, ignore.case=TRUE) ~ "Compensation % of Total Expenditure",
      TRUE ~ "Other"
    ),
    Level = case_when(
      grepl("pre-primary", `Indicator name`, ignore.case=TRUE) ~ "Pre-School",
      grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Elementary School",
      grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
      grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
      grepl("secondary", `Indicator name`, ignore.case=TRUE) ~ "Undergraduate School",
      grepl("tertiary", `Indicator name`, ignore.case=TRUE) ~ "Post-Grad Schooling",
      TRUE ~ "Other"
    )
  ) %>%
  filter(Type != "Other", Level != "Other", !is.na(Value))

data_teachers_long <- data_teachers_long %>%
  mutate(Value_10k = ifelse(Type == "Number of Teachers", Value / 10000, Value))

data_teachers_long <- data_teachers_long %>%
  mutate(Level = factor(Level, levels = c(
    "Pre-School",
    "Elementary School",
    "Middle School",
    "High School",
    "Undergraduate School",
    "Post-Grad Schooling"
  )))

#comparison plots
#i think ill make seperate graphs cuz these r too squished
p_line_teacher <- ggplot(data_teachers_long, aes(x=Year, y=Value_10k, color=Type)) +
  geom_line(size=1.2, na.rm=TRUE) +
  geom_point(size=2, alpha=0.8) +
  facet_wrap(~Level, nrow=3, scales="fixed") +
  theme_minimal() +
  theme(
    strip.background=element_rect(fill="grey90", color=NA)
  ) +
  labs(title="Teachers & Compensation Trends Over Time (USA)",
       y="Variable",
       color="Variable") +
  scale_color_manual(
    values = c("Number of Teachers" = "#FF69B4",
               "Compensation % of Total Expenditure" = "#3bbf8f"),
    labels = c("Number of Teachers\nIn 10,000s", "Teacher Compensation % of\nTotal Instituation Expenditure")
  ) +
  scale_x_continuous(breaks = seq(2014, 2021, by = 2))
ggsave("line_teacherl.png", p_line_teacher)

p_teacher_interactive <- ggplotly(p_line_teacher)
p_teacher_interactive
#elemntary/middle/high vs undergrad/post-grad

#supplemental plots 
# count per Type
p1 <-ggplot(data_teachers_long, aes(x=Type, fill = Type)) +
  geom_bar() +
  theme_minimal() +
  labs(title="Count of Observations per Type", x="Variable", y="Count")
p1

ggsave("Observation_counts_variable_teacher.png", p1)

# count per Year
p2 <- ggplot(data_teachers_long, aes(x=factor(Year))) +
  geom_bar(fill="#c562f0") +
  theme_minimal() +
  labs(title="Number of Observations per Year", x="Year", y="Count")
p2

ggsave("Observation_counts_year_teacher.png", p2)

# count per Level
p3<-ggplot(data_teachers_long, aes(x=Level, fill=Type)) +
  geom_bar(position="dodge") +
  theme_minimal() +
  labs(title="Number of Observations per School Level", x="Level", y="Count", fill="Variable")
p3

ggsave("Observation_counts_school_teacher.png", p3)

#tendency
summary_stats_teachers <- data_teachers_long %>%
  group_by(Type, Level) %>%
  summarise(
    n = n(),                           
    Mean = mean(Value, na.rm=TRUE),
    Median = median(Value, na.rm=TRUE),
    SD = sd(Value, na.rm=TRUE),
    Variance = var(Value, na.rm=TRUE),
    Minimum = min(Value, na.rm=TRUE),
    Maximum = max(Value, na.rm=TRUE),
    Range = Maximum - Minimum,
    Q1 = quantile(Value, 0.25, na.rm=TRUE),
    Q3 = quantile(Value, 0.75, na.rm=TRUE),
    IQR = Q3 - Q1,
  ) %>%
  arrange(Type, Level)  

paged_table(summary_stats_teachers)
write_csv(summary_stats_teachers, "tendency_teachers.csv")

Interim Takeaways

These observations are preliminary and will guide further analysis.

Reproducibility Notes
All intermediate datasets and figures are written to disk (EdStats_USA.csv, EdStats_attend.csv, EdStats_teacher.csv, and exported PNGs).
This allows others to review and re-run later steps without repeating earlier processing.
Consider adding sessionInfo() at the end to document package versions.